Support raw HSACO custom kernargs#12
Conversation
Preserve native kernarg layouts for raw HSACO kernels so prepacked HIP argument buffers from libraries such as hipBLASLt can dispatch correctly through the AMDGPU HAL.
|
CI caught an error. |
Co-authored-by: Ben Vanik <ben.vanik@gmail.com>
Hoist raw HSACO kernel iteration into executable creation so the resolver only writes a single host kernel-args record.
Keep implicit-argument layout metadata in the dispatch kernarg layout instead of duplicating it in the hot kernel-args table.
benvanik
left a comment
There was a problem hiding this comment.
Second-pass review notes. I agree this is moving in a better direction overall, but I think there are still a few contract issues that need tightening before this is safe to merge. Most of them come from one pressure point: the raw/native kernarg model is starting to share helpers with the regular HAL/flatbuffer model, and those two projections need to stay separate.
1. Raw kernarg reflection leaks into flatbuffer executable loading
The new default HSACO reflection projection computes constant_count from the full aligned native kernarg segment:
That helper is still used by the flatbuffer executable path, which validates it against the flatbuffer ExportDef.constant_count:
For a normal wrapped executable with two 64-bit bindings and two 32-bit constants, the native kernarg segment is 24 bytes, so this projects constant_count == 6. The flatbuffer ABI still says there are 2 constants. That means the flatbuffer path can reject valid wrapped executables because the raw-HSACO projection is now pretending padding and binding bytes are HAL constants.
The same leak shows up in parameter offsets. The public HAL contract says binding parameter offsets are binding ordinals:
But populate_default_export_parameters now writes raw kernarg byte offsets for bindings:
Suggested fix: split this into two explicit projections instead of trying to make one "default" helper mean both things.
- Flatbuffer/HAL projection: compact visible metadata into HAL ABI reflection, with binding offsets as binding ordinals and constants as compact byte offsets/counts. This is what the wrapped executable path validates against
ExportDef. - Raw HSACO/native projection: preserve native kernarg offsets and expose enough byte coverage for custom-direct prepacked argument blobs. This should be used only by the raw HSACO executable path.
That split also gives the names room to state the invariant. Something like calculate_hal_export_parameter_requirements versus calculate_native_kernarg_export_parameter_requirements would make accidental cross-use much harder.
2. Custom-direct validation accepts truly short argument buffers for the common no-implicit-suffix case
iree_hal_amdgpu_executable_initialize_dispatch_descriptor leaves custom_kernarg_layout as all-zero unless an implicit suffix is detected:
Both planners then use custom_kernarg_layout.explicit_kernarg_size as the required minimum:
When there is no implicit suffix, that required size is zero. So the relaxed validation accepts any custom direct blob length, including one that is actually shorter than the declared kernel kernarg segment. The writer then uses the caller's length as the dynamic layout size:
That is looser than the intended "allow missing trailing ABI padding" rule. It can accept a genuinely short prepacked buffer and dispatch with missing argument bytes.
Suggested fix: even for dynamic custom-direct layouts, preserve a validated minimum explicit byte count from metadata/symbol information. The dispatch-time rule should be roughly:
iree_host_align(constants.data_length, 8) >= layout->explicit_kernarg_sizeThen the writer may clamp extra trailing bytes and zero missing padding up to the reservation. The dynamic part can still mean "the exact reservation may be caller-sized when no implicit suffix exists", but the minimum legal byte count should not be zero unless the kernel really has a zero-byte kernarg ABI.
3. Host-queue custom-direct block count truncates before the safety check
The host queue path computes a runtime block count from constants.data_length and immediately casts it to uint32_t:
The ring-capacity check later sees only the truncated value:
But the actual emplace path still memset/memcpys the full constants.data_length:
So an oversized iree_host_size_t length can wrap to a small block count, pass validation, reserve too few queue kernarg blocks, and then overrun the queue-owned kernarg allocation. This is security/correctness-sensitive because the direct host-queue path is not protected by the command-buffer qword length guards.
Suggested fix: keep the ceil-div result in iree_host_size_t, reject values greater than UINT32_MAX, then cast. It would be worth sharing a tiny checked helper with executable descriptor initialization so every path that converts kernarg byte length to block count has the same overflow behavior.
4. ELF symbol augmentation can read past malformed section bounds
The new synthetic kernel augmentation scans ELF symbol sections, but the appended-symbol pass checks only that symbol_section_offset <= elf_data.data_length:
It does not check symbol_section_size <= elf_data.data_length - symbol_section_offset. The first thing iree_hal_amdgpu_hsaco_metadata_is_elf_kernel_symbol does is read symbol[4]:
The second pass repeats the same unchecked range pattern:
So malformed ELF section metadata can produce out-of-bounds symbol reads before the later symbol-name helper gets a chance to reject the symbol.
Suggested fix: use the full range check used elsewhere in this file before deriving symbol_count, including multiplication/addition overflow checks for symbol_count * symbol_entry_size. If the section is malformed, skip it or fail loudly, but the code should never form a symbol pointer outside the ELF byte span.
5. append_elf_kernel_symbols leaks expansion allocations on failure
After the expansion storage is allocated:
a failure allocating extra_symbol_name_storage leaks expanded.kernels/args:
Later IREE_RETURN_IF_ERROR exits during the second scan can leak both allocations, because the expanded buffers have not yet been installed into metadata and the caller's cleanup only sees the original metadata object:
Suggested fix: make the expansion path a status-gated block with one terminal cleanup section. Until ownership is committed into metadata, expanded.kernels and expanded.owned_string_storage need local cleanup on every error path.
Larger design concern: synthetic ELF kernels need to be quarantined/opted in
This helper now invents metadata for ELF function symbols that do not have metadata entries by picking one template kernel and copying its ABI/workgroup metadata:
That is a very strong producer-specific assumption. It may match a Tensile/hipBLASLt packaging pattern, but it is not generally true for arbitrary raw HSACO: a code object can contain multiple kernels with different argument layouts, required workgroup sizes, group/private segment sizes, or hidden suffix behavior. Publishing synthetic exports with copied metadata makes those kernels look valid until dispatch, where the failure mode is wrong argument layout rather than a clean load-time error.
Suggested fix: quarantine this behind an explicit opt-in. A few possible shapes:
- Put the synthetic-kernel expansion behind a raw-HSACO executable flag/option that names the producer assumption, instead of doing it unconditionally in
iree_hal_amdgpu_hsaco_metadata_initialize_from_elf. - Restrict it to a narrow, validated pattern: exactly one metadata template kernel, extra ELF kernel symbols known to share the same ABI by producer contract, and ideally an executable parameter that says HRX is intentionally loading that bundle form.
- Keep the generic parser honest: by default, only kernels with actual AMDGPU metadata should become dispatchable exports. Missing metadata should mean "not reflectable/not dispatchable by this generic path", not "borrow an arbitrary template".
If this expansion stays, tests should include at least one negative case with two real metadata kernels of different ABI plus one extra symbol, so the loader proves it does not silently clone the wrong contract.
|
(my prompt to gpt5.5-xhigh was: |
Preserve native kernarg layouts for raw HSACO kernels so prepacked HIP argument buffers from libraries such as hipBLASLt can dispatch correctly through the AMDGPU HAL.